Add Megatron-Bridge recipe-free distillation example script by kevalmorabia97 · Pull Request #861 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-02-06T02:07:57Z

What does this PR do?

Type of change: New example script

M-Bridge recipe-free distillation script so its more easier to run and can support pruned models
Fix resuming distillation run

Usage

torchrun --nproc_per_node 8 distill.py \
    --teacher_hf_path Qwen/Qwen3-8B \
    --student_hf_path Qwen3-8B-NAS-Pruned-6B \
    --tp_size 8 \
    --data_paths <climbmix 25% tokenized (~90B tokens)> \
    --data_path_to_cache /path/to/cache/climbmix_dataset_indices_qwen3 \
    --seq_length 4096 \
    --mbs 8 \
    --gbs 768 \
    --train_iters 28500 \
    --lr 1e-4 \
    --min_lr 1e-5 \
    --lr_warmup_iters 100 \
    --eval_interval 500 \
    --eval_iters 32 \
    --log_interval 10 \
    --output_dir qwen3_8b_6b_mbridge_distill

Testing

Re-ran Qwen3 8B -> 6B experiments and compare with Nemo2 results from blog

Best subnet from NAS: {'num_layers': 30, 'hidden_size': 3584, 'ffn_hidden_size': 11776} -> 5.99B params, 0.5718 score

Model	MMLU	GSM8K - flexible, strict	MBPP (coding)
Qwen3-8B	74.9	87.5, 84.6	65.4
Qwen3-8B-Pruned-6B	57.6	11.6, 10.0	4.8
Qwen3-8B-Pruned-6B (Distilled for 16k steps i.e. 50B tokens ~3k GPU hours)	71.6	78.0, 64.7	43.4
Qwen3-8B-Pruned-6B (Distilled for 28.5k steps i.e. 90B tokens ~5.2k GPU hours)	71.9	78.1, 64.8	44.2
Qwen3-4B	70.0	81.1, 84.7	62.8

Previous Nemo2 experiments on depth pruned Qwen3 8B -> 6B (24 layers) had MMLU ~72.0 so more or less similar. No hparam tuning done for current M-Bridge distillation run

(Separate PR) GitHub CI/CD test for example script with NeMo 26.02 container

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: N/A
Did you add or update any necessary documentation?: Yes
Did you update Changelog?: Yes

Summary by CodeRabbit

Release Notes

New Features
- Added complete distillation workflow and example for Megatron-Bridge optimization.
Documentation
- Enhanced setup guide with Docker workflows, data preparation steps, and detailed distillation instructions.
- Improved usage documentation and help references.
Improvements
- Better data preprocessing output with human-readable formatting for metrics.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

copy-pr-bot · 2026-02-06T02:08:01Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-02-06T02:08:05Z

📝 Walkthrough

Walkthrough

The pull request extends the Megatron-Bridge examples with a comprehensive distillation workflow, including a new distill.py script for orchestrating student model distillation from teacher models, expanded documentation with end-to-end instructions, and minor enhancements to logging and utility scripts.

Changes

Cohort / File(s)	Summary
Documentation Updates `CHANGELOG.rst`, `examples/megatron_bridge/README.md`	Updated changelog to reference distillation examples; expanded README with detailed Docker-based workflow, HuggingFace login requirements, data preparation instructions, comprehensive distillation usage examples with torchrun commands, and mock data testing guidance.
Distillation Workflow `examples/megatron_bridge/distill.py`	New comprehensive distillation script providing command-line interface for model distillation, model provisioning from HuggingFace checkpoints via AutoBridge, distributed training configuration with parallelism settings, dual dataset support (real data via GPTDatasetConfig or mock data), optimizer and scheduler setup, and full training execution with directory management.
Script Updates `examples/megatron_bridge/prune_minitron.py`, `modelopt/torch/utils/plugins/megatron_preprocess_data.py`	Updated help instructions in prune_minitron.py to use torchrun invocation; enhanced megatron_preprocess_data.py with human-readable number formatting for processing progress and token count outputs.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Command Line
    participant Main as main(args)
    participant HF as HuggingFace<br/>Checkpoints
    participant Bridge as AutoBridge<br/>Providers
    participant Distill as DistillationProvider
    participant Config as ConfigContainer
    participant Trainer as distill()
    
    CLI->>Main: Parse arguments (student/teacher HF paths, data, parallelism)
    Main->>HF: Load student & teacher checkpoints
    HF-->>Bridge: Return models
    Bridge->>Bridge: Build Megatron providers
    Bridge->>Bridge: Override parallelism & training settings
    Main->>Distill: Wrap providers with DistillationProvider
    Main->>Config: Assemble dataset, optimizer, scheduler,<br/>logging, checkpoint configs
    Config-->>Trainer: Pass ConfigContainer
    Main->>Trainer: Execute distill(config)
    Trainer->>Trainer: Create output/checkpoint directories
    Trainer->>Trainer: Run distributed training loop
    Trainer-->>Main: Report completion
    Main->>Main: Cleanup distributed environment

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title directly describes the main change: introducing a new Megatron-Bridge recipe-free distillation example script, which aligns with the primary addition of distill.py and related documentation updates.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kmorabia/mbridge-distill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-02-06T02:20:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.44%. Comparing base (452c5a0) to head (59bc44c).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #861      +/-   ##
==========================================
- Coverage   73.72%   73.44%   -0.28%     
==========================================
  Files         196      197       +1     
  Lines       20457    20657     +200     
==========================================
+ Hits        15082    15172      +90     
- Misses       5375     5485     +110

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AAnoosheh · 2026-02-06T13:05:12Z

+    )
+
+    print_rank_0("\nStarting distillation...")
+    distill(config)


Should we make it like the Nemo one where it can do either pretrain(), distill(), or finetune() all in one file? (@ChenhanYu would that be preferred?)

How about we incrementally extend this file as we get to needing these options?

Maybe I should rename to train.py?

I guess right now we can easily just put a pretrain() call if the KD-specific args aren't provided.
SFT can be done later since dataset/template/etc is different.

AAnoosheh · 2026-02-06T13:05:46Z


 ```bash
-python /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help
+torchrun --nproc_per_node 1 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help


I want to only print help on rank 0 so need to initialize multiprocesses which will only happen on torchrun. I am not spawning multiprocesses in the script so running with python ... will also result in an error trying to fine RANK env variable during dist setup

mbridge has it's own print_rank_0 which accounts for that

We should additionally change our own print_rank_0 to work without dist initialized

Our print_rank_0 works fine on non-dist env. The issue here is I am manually doing dist.setup() which fails if not running with torchrun. Since I am doing all low-level M-bridge stuff myself because of lack of top-level APIs, we dont get the dist setup from M-Bridge

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/megatron_bridge/README.md (1)
11-11: ⚠️ Potential issue | 🟡 Minor

Grammatical error: "distillation" → "distilling".

"Examples of distillation a pruned or quantized model" should read "Examples of distilling a pruned or quantized model".
Proposed fix
-| Distillation | Examples of distillation a pruned or quantized model | \[[Link](`#distillation`)\] | |
+| Distillation | Examples of distilling a pruned or quantized model | \[[Link](`#distillation`)\] | |

🤖 Fix all issues with AI agents

In `@examples/megatron_bridge/distill.py`:
- Around line 120-122: Make --use_mock_data and --data_paths mutually exclusive
instead of silently letting mock data win: when building the CLI parser, create
a mutually exclusive group via parser.add_mutually_exclusive_group() and add the
two flags to that group (referencing args.use_mock_data and args.data_paths),
then remove the manual validation block that raises ValueError for
neither-provided; this ensures argparse enforces exclusivity and you can keep
the later code path that reads data_paths unchanged.
- Around line 132-134: The code computes checkpoint_dir and tensorboard_dir in
main(args: argparse.Namespace) but never ensures they exist; add explicit
directory creation before these paths are passed into
CheckpointConfig/LoggerConfig by calling os.makedirs(checkpoint_dir,
exist_ok=True) and os.makedirs(tensorboard_dir, exist_ok=True) (ensure imports
include os if not already) so the directories derived from args.output_dir are
created ahead of use.

🧹 Nitpick comments (3)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

113-116: Minor: num2hrb on small document counts displays decimals (e.g. "5.00 docs").

When count is small, num2hrb formats it as "5.00" which reads slightly oddly for a document count. This is cosmetic and doesn't affect functionality — just worth noting if you want polished early-iteration output.
examples/megatron_bridge/README.md (1)
36-38: Hardcoded Python 3.12 path in site-packages mount is fragile.

The volume mount path /opt/venv/lib/python3.12/site-packages/modelopt assumes the NeMo container uses Python 3.12. This will silently break if a future container version changes the Python version. Since you pin nemo:26.02, this is acceptable for now, but consider adding a comment noting the Python version dependency so future maintainers know to update this path.
Suggested comment
   -v ${MODELOPT_DIR}:/opt/Model-Optimizer \
-  -v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
+  -v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \  # Update python3.12 if container Python version changes
examples/megatron_bridge/distill.py (1)

162-168: Hardcoded adam_beta2=0.98 — consider exposing as CLI arg or documenting the choice.

adam_beta2=0.98 differs from the common default of 0.999. While 0.98 is reasonable for distillation/pre-training, it's not configurable via CLI. A comment explaining the choice would help users who want to tune this.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

jenchen13 · 2026-02-11T20:58:05Z

+    dataset_kwargs = {
+        "seq_length": args.seq_length,
+        "path_to_cache": args.data_path_to_cache,
+        "random_seed": SEED,


can the seed also be a random arg?

You mean randomly generate everytime? Then the results may not be reproducible

jenchen13 · 2026-02-11T20:58:57Z

+  --ulimit memlock=-1 \
+  --rm -it \
+  -v ${MODELOPT_DIR}:/opt/Model-Optimizer \
+  -v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \


why is mounting to venv also necessary?

So users can mount library and examples from same version. This avoids the case where user uses old modelopt but with examples from main branch

jenchen13 · 2026-02-11T21:00:25Z

+To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:
+
+```bash
+uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \


do we assume the user already has uv installed?

Its in the nemo container so already installed

## What does this PR do? **Type of change:** New example script  - [x] M-Bridge recipe-free distillation script so its more easier to run and can support pruned models - [x] Fix resuming distillation run ## Usage  ```python torchrun --nproc_per_node 8 distill.py \ --teacher_hf_path Qwen/Qwen3-8B \ --student_hf_path Qwen3-8B-NAS-Pruned-6B \ --tp_size 8 \ --data_paths <climbmix 25% tokenized (~90B tokens)> \ --data_path_to_cache /path/to/cache/climbmix_dataset_indices_qwen3 \ --seq_length 4096 \ --mbs 8 \ --gbs 768 \ --train_iters 28500 \ --lr 1e-4 \ --min_lr 1e-5 \ --lr_warmup_iters 100 \ --eval_interval 500 \ --eval_iters 32 \ --log_interval 10 \ --output_dir qwen3_8b_6b_mbridge_distill ``` ## Testing  - [x] Re-ran Qwen3 8B -> 6B experiments and compare with Nemo2 results from blog Best subnet from NAS: `{'num_layers': 30, 'hidden_size': 3584, 'ffn_hidden_size': 11776} -> 5.99B params, 0.5718 score` | Model | MMLU | GSM8K - flexible, strict | MBPP (coding) | | ------- | ------ | ------- | ------- | | Qwen3-8B | 74.9 | 87.5, 84.6 | 65.4 | | Qwen3-8B-Pruned-6B | 57.6 | 11.6, 10.0 | 4.8 | | Qwen3-8B-Pruned-6B (Distilled for 16k steps i.e. 50B tokens ~3k GPU hours) | 71.6 | 78.0, 64.7 | 43.4 | | Qwen3-8B-Pruned-6B (Distilled for 28.5k steps i.e. 90B tokens ~5.2k GPU hours) | 71.9 | 78.1, 64.8 | 44.2 | | Qwen3-4B | 70.0 | 81.1, 84.7 | 62.8 | Previous Nemo2 experiments on depth pruned Qwen3 8B -> 6B (24 layers) had MMLU ~72.0 so more or less similar. No hparam tuning done for current M-Bridge distillation run - [ ] (Separate PR) GitHub CI/CD test for example script with NeMo 26.02 container ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes  - **Did you write any new necessary tests?**: N/A - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes   ## Summary by CodeRabbit ## Release Notes * **New Features** * Added complete distillation workflow and example for Megatron-Bridge optimization. * **Documentation** * Enhanced setup guide with Docker workflows, data preparation steps, and detailed distillation instructions. * Improved usage documentation and help references. * **Improvements** * Better data preprocessing output with human-readable formatting for metrics.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

## What does this PR do? **Type of change:** New example script  - [x] M-Bridge recipe-free distillation script so its more easier to run and can support pruned models - [x] Fix resuming distillation run ## Usage  ```python torchrun --nproc_per_node 8 distill.py \ --teacher_hf_path Qwen/Qwen3-8B \ --student_hf_path Qwen3-8B-NAS-Pruned-6B \ --tp_size 8 \ --data_paths <climbmix 25% tokenized (~90B tokens)> \ --data_path_to_cache /path/to/cache/climbmix_dataset_indices_qwen3 \ --seq_length 4096 \ --mbs 8 \ --gbs 768 \ --train_iters 28500 \ --lr 1e-4 \ --min_lr 1e-5 \ --lr_warmup_iters 100 \ --eval_interval 500 \ --eval_iters 32 \ --log_interval 10 \ --output_dir qwen3_8b_6b_mbridge_distill ``` ## Testing  - [x] Re-ran Qwen3 8B -> 6B experiments and compare with Nemo2 results from blog Best subnet from NAS: `{'num_layers': 30, 'hidden_size': 3584, 'ffn_hidden_size': 11776} -> 5.99B params, 0.5718 score` | Model | MMLU | GSM8K - flexible, strict | MBPP (coding) | | ------- | ------ | ------- | ------- | | Qwen3-8B | 74.9 | 87.5, 84.6 | 65.4 | | Qwen3-8B-Pruned-6B | 57.6 | 11.6, 10.0 | 4.8 | | Qwen3-8B-Pruned-6B (Distilled for 16k steps i.e. 50B tokens ~3k GPU hours) | 71.6 | 78.0, 64.7 | 43.4 | | Qwen3-8B-Pruned-6B (Distilled for 28.5k steps i.e. 90B tokens ~5.2k GPU hours) | 71.9 | 78.1, 64.8 | 44.2 | | Qwen3-4B | 70.0 | 81.1, 84.7 | 62.8 | Previous Nemo2 experiments on depth pruned Qwen3 8B -> 6B (24 layers) had MMLU ~72.0 so more or less similar. No hparam tuning done for current M-Bridge distillation run - [ ] (Separate PR) GitHub CI/CD test for example script with NeMo 26.02 container ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes  - **Did you write any new necessary tests?**: N/A - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes   ## Summary by CodeRabbit ## Release Notes * **New Features** * Added complete distillation workflow and example for Megatron-Bridge optimization. * **Documentation** * Enhanced setup guide with Docker workflows, data preparation steps, and detailed distillation instructions. * Improved usage documentation and help references. * **Improvements** * Better data preprocessing output with human-readable formatting for metrics.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Add Megatron-Bridge recipe-free distillation example script

a4ad1b8

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested review from AAnoosheh, ChenhanYu, LianaMikael, jenchen13 and yueshen2016 February 6, 2026 02:07

AAnoosheh reviewed Feb 6, 2026

View reviewed changes

Update docs

48c74bd

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the kmorabia/mbridge-distill branch from 8d780a5 to 48c74bd Compare February 6, 2026 21:48

minor

ce4d081

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the kmorabia/mbridge-distill branch from eb0aa58 to ce4d081 Compare February 9, 2026 19:41

Fix resuming

c18315b

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the kmorabia/mbridge-distill branch from d0f930f to 1df11df Compare February 10, 2026 18:31

kevalmorabia97 mentioned this pull request Feb 10, 2026

Distillation in M-Bridge #874

Closed

kevalmorabia97 force-pushed the kmorabia/mbridge-distill branch from 1df11df to 808c1e0 Compare February 10, 2026 19:55

kevalmorabia97 marked this pull request as ready for review February 10, 2026 19:57

kevalmorabia97 requested review from a team as code owners February 10, 2026 19:57

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

Comment thread examples/megatron_bridge/distill.py

Comment thread examples/megatron_bridge/distill.py

kevalmorabia97 force-pushed the kmorabia/mbridge-distill branch from 808c1e0 to 86e81b1 Compare February 10, 2026 20:40

Update readme

50b6b7e

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the kmorabia/mbridge-distill branch from 86e81b1 to 50b6b7e Compare February 10, 2026 20:41

minor doc update

59bc44c

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested a review from AAnoosheh February 11, 2026 20:45

AAnoosheh approved these changes Feb 11, 2026

View reviewed changes

kevalmorabia97 enabled auto-merge (squash) February 11, 2026 20:59

jenchen13 approved these changes Feb 11, 2026

View reviewed changes

kevalmorabia97 merged commit 8d67cb0 into main Feb 11, 2026
37 checks passed

kevalmorabia97 deleted the kmorabia/mbridge-distill branch February 11, 2026 22:29

Conversation

kevalmorabia97 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

codecov bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevalmorabia97 commented Feb 6, 2026 •

edited

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

codecov bot commented Feb 6, 2026 •

edited

Loading